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Abstract 

We build a complete and efficient ZK toolchain that handles proof 
statements encoded as arbitrary ANSI C programs. 

Zero-Knowledge (ZK) proofs are foundational in cryptography. Recent 
ZK research has focused intensely on non-interactive proofs of small state- 
ments, useful in blockchain scenarios. We instead target large statements 
that are useful, e.g., in proving properties of programs. 

Recent work (Heath and Kolesnikov, CCS 2020 [HK20a]) designed a 
proof-of-concept ZK machine (ZKM). Their machine executes arbitrary 
programs over a minimal instruction set, authenticating in ZK the pro- 
gram execution. In this work, we significantly extend this research thrust, 
both in terms of efficiency and generality. Our contributions include: 


e A rich and performance-oriented architecture for representing arbi- 
trary ZK proofs as programs. 


e A complete compiler toolchain providing full support for ANSI C95 
programs. We ran off-the-shelf buggy versions of sed and gzip, prov- 
ing in ZK that each program has a bug. To our knowledge, this is 
the first ZK system capable of executing standard Linux programs. 


e Improved ZK RAM. [HK20a] introduced an efficient ZK-specific 
RAM BubbleRAM that consumes O(log? n) communication per ac- 
cess. We extend BubbleRAM with multi-level caching, decreasing 
communication to O(log) per access. This introduces the possibil- 
ity of a cache miss, which we handle cheaply. Our experiments show 
that cache misses are rare; in isolation, i.e., ignoring other processor 
costs, BubbleCache improves communication over BubbleRAM by 
more than 8x. Using BubbleCache improves our processor’s total 
communication (including costs of cache misses) by ~ 25-30%. 

e Numerous low-level optimizations, resulting in a CPU that is both 
more expressive and ~% 5.5x faster than [HK20al’s. 

e Attention to user experience. Our engineer-facing ZK instrumenta- 
tion and extensions are minimal and easy to use. 

Put together, our system is efficient and general, and can run many 


standard Linux programs. The resultant machine runs at up to 11KHz 
on a 1Gbps LAN and supports MBs of RAM. 


1 Introduction 


Zero Knowledge (ZK) protocols allow a prover P to demonstrate to a verifier 
Y the truth of a given statement while revealing nothing additional. ZK proofs 
(ZKPs) are foundational cryptographic objects and have attracted wide research 
attention. ZK research originally focused on proofs of specific statements, but 
more recent works developed efficient techniques that handle proofs of arbitrary 
statements. Even with the pivot towards generality, the community’s focus has 
remained on succinct proofs of small statements, motivated largely by blockchain 
applications. In contrast, proofs of large, complex statements have been left 
relatively unexplored. 

Our focus: ZK for large programs. Recent work showed that ZK protocols 
can support a relatively efficient ZK processor [HK20a]. [HK20a] designed a 
new arithmetic ZK protocol and data structures, including a ZK-specific RAM 
BubbleRAM. They integrated these components into a processor that executes 
arbitrary programs in ZK, supports hundreds of KB of memory, and runs at 
2.1K Hz over a small instruction set. We continue [HK20a]’s research thrust into 
large proof statement support by improving both the ZK processor’s efficiency 
and the tooling available to programmers. 


1.1 Our Contribution 


We propose the first complete and concretely efficient ZK system for general C 
programs and libc!. Our system allows P and V to run a mutually agreed C 
program describing a proof statement via an interactive constant-round ZKP 
protocol. Our contribution comprises intertwined crypto-technical, architec- 
tural, and systems-technical components. We present: 


e An architecture for representing ZK proofs as programs. While circuit- 
based ZK is efficient even for large circuits (e.g., [HK20b, WYKW20]), 
circuits do not scale to programs with complex control flow. We design 
a complete architecture for high-performance CPU-emulation-based ZK, 
carefully balancing the cost and expressivity of each CPU step. Our ar- 
chitecture includes ZK-specific primitives; most notably, our prover oracle 
calls allow the programmer to effectively handle P’s input and to short- 
cut expensive algorithms that can be efficiently verified. We compare with 
previous ZK architectures [HK20a], [BCG*13] in Section 2. 


e Extensive compiler and library support. We fully support ANSI C95 pro- 
grams and libc. 


e Improved ZK RAM. ZK RAM allows the proof to efficiently look up an 
element from memory obliviously to V. | [HK20a] introduced an effi- 
cient ZK-specific RAM BubbleRAM that costs O(log? n) communication 
per access. We extend BubbleRAM with multi-level caching, decreasing 


1Both [HK20a] and [BCG*t 13] lack libc and cannot support off-the-shelf programs. They 
run, respectively, © 5.5x and 10000x slower than our work. 


communication to O(logn) per access. We call our caching RAM Bub- 
bleCache. BubbleCache introduces the possibility of a cache miss, which 
we handle cheaply. Our experiments show that cache misses are rare, 
and, ignoring other processor costs, BubbleCache improves communica- 
tion over BubbleRAM by more than 8x. Using BubbleCache improves 
our processor’s total communication (including costs of cache misses) by 
zx 25-30%. 


e Numerous low-level crypto optimizations as compared to [HK20a], such 
as a greatly improved ALU, tighter integration of instructions into the 
CPU circuit, and an efficient small table lookup technique. In sum, our 
non-memory based operations consume only 86 oblivious transfers, about 
a 4x improvement over [HK20a]. We also extend [HK20a]’s protocol with 
generalized vector-scalar multiplication. 


e User experience (UX). Our goal is to make ZK easy to use by engineers 
who are not trained in cryptography. We adhere to standard C, including 
native Linux I/O, user-input, and file-system operations, and require only 
that the programmer includes special QED instruction(s) as appropriate: 
reaching QED means a successful proof. 


e Use-cases drawn from existing popular Linux programs. We ran off-the- 
shelf, buggy versions of gzip and sed, proving in ZK the existence of 
bugs (CVE-2005-1228 and the SIR repository, respectively). The gzip 
bug consumes 44,092 CPU cycles and runs in 6.5s; the sed bug consumes 
390, 002 CPU cycles and runs in 36.1s. Each benchmark is ~ 5.4KLOC of 
C code. 


e We plan to open-source our project to the community. 


2 Related Work 


We compile programs written in high level C down to a low-level instruction set, 
and we evaluate instructions on an emulated ZK CPU. Our technique allows 
proofs of arbitrary statements, including statements that are large and complex. 
In our review of related work, we focus on concretely efficient ZK protocols and 
on works that pursue secure CPU-emulation. 

[HK20a] is the most relevant work, and we build on many of its techniques. 
Section 5 reviews their ZK protocol, which we build on. In this section, we com- 
pare to [HK20a] in several contexts: architecture, RAM, performance, support 
for general C. In sum, our work is better on all of these fronts. We highlight 
our cached RAM, our ability to run standard C programs, and our ~ 5.5x clock 
rate improvement. 


ZK ZK proofs [GMR85, GMW91] allow P to convince V, holding circuit C, 
that there exists an input, or witness, w for which C(w) = 1. Early practical 


ZK protocols, motivated by signatures and identification schemes, focused on 
algebraic relations. More recently, ZK research has shifted focus to proofs of 
arbitrary statements. 


ZK from garbled circuits (GC-ZK) and 2PC GC-ZK techniques encode 
ZK relations as garbled circuits [JKO13, FNO15, HK20b]. [JKO13] established a 
Garbling Scheme (GS)-based ZKP framework: by satisfying a few requirements, 
a new GS can be plugged into [JKO13]’s protocol to obtain malicious-verifier 
ZK. The [HK20a] protocol, which we leverage, is formalized in the [JKO13] 
framework. 

[HK20b] showed that GC-ZK can efficiently handle conditional branching. 
Their ‘stacked garbling’ technique yields communication that scales only with 
the longest execution path, not with the size of the circuit. While [HK20b] 
improves communication, their computation remains linear in the number of 
branches. Natural circuit-based handling of arbitrary control flow can result 
in significant blow-up in circuit size, and is often infeasible. Our work also 
scales only in the longest execution path, but does so by arranging the program 
into CPU steps. Because we change the program representation, we enjoy both 
communication and computation that scale linearly in the program’s execution 
time. We did not apply stacked garbling in our design. Our CPU is hand- 
optimized to aggressively amortize low-level operations. Porting stacking to the 
[HK 20a] protocol is not trivial and would not substantially improve our already 
lean CPU. 

Wolverine [WYKW20] recently improved over GC-based ZK by instead run- 
ning a maliciously-secure GMW protocol where VY has no input, allowing all 
GMW gates to be run in parallel. We use [HK20a]’s protocol which similarly 
runs all gates in parallel: our prover P simultaneously requests wire labels on 
each algebraic wire and then proves they are all related. In terms of cost, 
[HK20a]’s protocol is quite similar to Wolverine “per gate” for our desired se- 
curity and representation. Wolverine requires 2—4 field elements per arithmetic 
gate in a large field. To compute our basic operation, vector-scalar multiplica- 
tion of a vector with n field elements, we perform a single 1-out-of-2 OT of n 
field elements. Thus both we and Wolverine transmit a similar number of field 
elements per arithmetic gate. 

Unlike direct circuit representations, CPU-based techniques allow arbitrary 
control flow and scale to more realistic programs. Of course, CPU-based ZK is 
built on circuit protocols, so circuit protocol improvements remain an excellent 
direction for future work. 


ZK Processors and Architectures [BCTV14b, BCTV14a, BCG? 13] im- 
plement ZK processors via succinct non-interactive proof engines. In a sense, 
these approaches are more general than ours: our approach is interactive and 
requires linear communication. The trade-off is efficiency. These works yield 
processors that run in the 1Hz range. In contrast, our processor operates in the 
10KHz range. These works also introduced the TinyRAM architecture, which 


is similar in scope to our architecture: it allows proofs of arbitrary statements 
formulated as C programs. We build a custom architecture so as to extract as 
much performance as possible from the underlying protocol. As a result, our 
non-RAM components are cheap. TinyRAM CPU (without RAM operations) 
consumes % 1000 algebraic constraints per cycle, while ours consumes only 86 
OTs (OTs are comparable to constraints). 

[HK20a] implements a concretely efficient interactive ZK processor and is 
the closest to our work. Their focus is proof-of-concept, so they only support a 
small subset of C. In contrast, we comprehensively handle C programs. 

Our architecture provides ZK-specific features supported neither by [HK20a] 
nor by [BCG*13]. Most notably, we include instructions that we call prover 
oracle calls. Prover oracle calls allow the programmer to structure P’s input 
and also to use P to shortcut algorithms that are expensive to compute but 
efficient to verify. We view prover oracle calls as a crucial component for general 
ZK handling. Spice [SAGL18] used remote procedure calls (RPC), a mechanism 
similar to our prover oracle calls, to achieve similar goals. 

We mention, but do not discuss in detail, MPC processors [LO13, WGMK16, 
SHS*15]; their task is different and they are much less efficient than our ZK 
processor. 


Succinct and non-interactive ZK Our work is built on an efficient inter- 
active protocol with proof size linear in the program running time. Other works 
emphasize non-interactivity and/or succinct proofs. We review such works in 
Appendix A. In short, these works have excellent performance for small proof 
statements, but they struggle when handling larger proof statements. 


ZK RAM A key component in our processor is our ZK BubbleCache, an 
improvement on the state-of-the-art BubbleRAM [HK20a]. In our experiments 
(Section 9), BubbleCache is up to 8x more efficient than BubbleRAM (improve- 
ment depends on RAM size), which was already concretely efficient. Asymptoti- 
cally, BubbleCache improves BubbleRAM’s O(log” n) cost to O(log n), although 
it introduces the possibility of cache misses. Like BubbleRAM, BubbleCache 
does not use extra rounds of communication. 

[BCG*13] and a subsequent work [WSR*15] gave RAM-like constructions 
whereby P provides all RAM values as inputs and then, at the end of the 
protocol, permutes all read/write values and demonstrates consistency. The 
technique features high concrete costs: (1) P provides each value as extra input 
and (2) the consistency check requires checking that values are appropriately 
sorted. In contrast, BubbleRAM/BubbleCache read values directly from stor- 
age and feature an extremely cheap consistency check: on each access, P simply 
proves two values are equal, requiring amortized zero bytes of communication. 
More specifically, for a RAM of size n and for k RAM operations, Buffet’s 
[WSR*15] cost is (21+10logk+2logn) arithmetic constraints per RAM oper- 
ation. BubbleRAM incurs 3 log? n constraints (i.e., OTs) per RAM operation, 
and BubbleCache further improves on BubbleRAM as explained above. 


Spice [SAGL18] specified an O(1) cost key-value store. Their approach el- 
egantly improves over Merkle-tree based RAMs by maintaining two efficiently 
updatable hash digests: one for all reads and one for all writes. On an access, 
P provides the requested element as input, and the proof accordingly updates 
the two hashes. At the end of the proof, the two hashes are used to ‘audit’ 
all accesses, demonstrating that P honestly constructed each input. While the 
approach incurs amortized O(1) cost per access, the concrete costs are rela- 
tively high: their approach (and straightforward O(logn) Merkle-tree based 
ZK RAMs) require that the hash function be evaluated inside the ZK circuit. 
BubbleCache does not use non-black-box hash function calls. 


3 Notation and Assumptions 

3.0.1 Notation 
e We refer to our running system as ZKM: ZK machine. 
e a is the statistical security parameter (e.g., 40). 
e « is the computational security parameter (e.g., 128). 


e The prover is P. We refer to P by she, her, hers... 


The verifier is V. We refer to V by he, him, his... 


e We write £ to denote that the left hand side is defined to be the right 
hand side. 


We denote that x is uniformly drawn from a set S by x Eş S. 
e We denote the field of elements {0,1,...,p — 1} by Zp. 
e We denote the set of elements {1,2,...,p — 1} by Z. 
e We work with additive secret shares in a field Zp. We denote a sharing of a 
semantic value x € Zp by [x]. Sharings are formally defined in Section 5. 
3.0.2 Cryptographic Assumptions 


We use the recent and efficient Ferret oblivious transfer technique [YWL* 20]. 
Thus, we inherit its assumptions: (1) learning parity with noise (LPN), (2) 
a tweakable correlation-robust hash function, and (3) a random oracle (RO). 
If we assume access to an OT oracle, our technique is secure under standard 
assumptions. 
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Figure 1: The stack of abstractions and implementations in our system. Each 
implementation is built using the abstraction below it. At the top level, users 
write custom ANSI C programs. We implement ANSI C and libc via our cus- 
tom compiler and runtime library. Our compiler interfaces with our architec- 
ture, which is the high level design of our emulated processor; the architecture 
consists primarily of our instruction set. We refer to the implementation of 
our architecture as our microarchitecture. The microarchitecture consists of low 
level processor components and algorithms, such as our ALU and BubbleCache. 
Finally, our microarchitecture is implemented in terms of authenticated shares, 


(Section Vy 


realized by [HK20a]’s ZK protocol. 


4 Presentation Roadmap 


At a high level, we bridge end user programs and an underlying ZK protocol. 
We achieve this in two steps (see Figure 1). First, end user proof statements 
are expressed in high level C. These C programs are translated into ZK assembly 
by our custom compiler toolchain. The resultant assembly is specified by our 
architecture, the high level abstraction of our emulated processor. Second, we 
execute the assembly on our custom emulated processor. Our microarchitec- 
ture (the implementation of our architecture) consists of a number of low-level 
and algorithmic choices that emphasize efficiency. The microarchitecture is im- 
plemented on top of authenticated shares, provided by our generalization of 
[HK20a]’s protocol. 

Our presentation proceeds bottom-up through Figure 1’s abstraction stack, 
starting from the most primitive cryptographic objects and working up to the 
handling of end user programs: 


e Section 5 reviews [HK20a]’s authenticated share algebra as well as their 
cryptographic protocol. 


e Section 6 presents our architecture, including a high level description of our 
machine as well as its instruction set. We also present extended discussions 
of so-called prover oracle calls and of hardware interrupts. 


e Section 7 describes our microarchitecture with an emphasis on algorith- 
mic improvements, including our caching ORAM BubbleCache and our 
efficient ALU. 


e Section 8 shows how we integrated our processor into a user-facing system. 
We describe our C compiler and libc implementations, and we show how 
the parts of our system interoperate. 


The cross-cutting concern of the stack in Figure 1 is performance: many 
choices in our design are ultimately informed by the cost of operations in the 
authenticated share algebra. 

Section 9 concludes with experimental findings of the efficiency of our system. 


5 [HK20a]’s Authenticated Share Algebra 


Our processor is built on [HK20a]’s ZK protocol. [HK20a]’s protocol implements 
the primitives listed in Figure 2. 

In the protocol, P and V hold authenticated sharings of values in a field Zp 
for a suitably large prime p (we choose p = 2*° — 87, the largest 40 bit prime). 
An authenticated sharing consists of two shares, one held by V and one by P. 
We denote a sharing where V’s share is s € Zp and P’s share is t € Zp by writing 
(s,t). At the start of the protocol, V samples a non-zero global value A Eş Z% . 
Consider a sharing (X, xA — X) where X €¢ Zp is drawn by V. A sharing of 


Operation Communication (Bytes) 


[z] + [yl > [r +y] 0 
[z] — [yl > [z - y] 0 
biatge. EE el n 
cH [c] 0 
ziy] > lzy] 5- |g 
P proves [x] = [0] 0 (amortized) 
xe [a] 5 
Composite Decompose 32-bit word into bits 32-5 = 160 
Permute n m-word elements 5:m-n-logn 


Figure 2: ZK protocol operations and their communication cost. Primitive 
operations include (1) adding two sharings, (2) subtracting two sharings, (3) 
multiplying a sharing by a public constant, (4) encoding a public constant as 
a sharing, (5) multiplying a vector of sharings by a bit selected by P, and 
(6) checking that a sharing encodes zero. Above, c denotes a public constant 
and x outside of a share denotes a value chosen by P. Complex operations 
are composed from primitives; we list common composite operations and their 
costs. 


this form is a valid sharing of the semantic value x € Zp. We use the shorthand 
[x] to denote a valid sharing: 


[x] £ (X, xA — X) where X €g Zp 
Sharings have two key properties: 


1. V’s share gives no information about the semantic value. This holds triv- 
ially: V’s share is independent of x. 


2. P’s share is ‘unforgeable’: P cannot use xA — X to construct yA — X for 
y # x. This is because (1) both X and A are uniform and unknown to P, 
(2) the multiples of A are uniformly distributed over the field, and (3) the 
chosen prime is large enough to achieve our desired security: P can forge 

: : š 1: 1 
ly] only by guessing yA — X, which only succeeds with probability ST 


We review primitive operations on shares in Appendix B, including our sim- 
ple but very useful generalization to the protocol: a vector-scalar multiplication 
that generalizes [HK20a]’s share multiplication technique. The interface to and 
cost of protocol primitives are given in Figure 2. In short, linear operations over 
sharings are computed locally, but each vector-scalar multiplication requires 
communication in the form of a single oblivious transfer (OT). 


6 An Architecture for ZK Programs 


This section presents our ZK architecture. Specifically, we formalize our pro- 
cessor and its instruction set. 


We express ZK relations as high level C programs. Section 2 discussed that 
circuit-based ZK dominates in the literature, but that circuits do not scale to 
arbitrary control flow. We instead adopt a CPU emulation-, or ZK Processor- 
based architecture: P and VY jointly, authentically and obliviously to V execute 
a sequence of CPU steps that together evaluate the program. V accepts the 
proof if the program terminates in some distinguished state. Each CPU step is 
implemented by a circuit; we carefully manage these circuits to ensure correct 
execution of the program. 

Section 6.1 begins informally, summarizing the “platform” on which we build 
our architecture. Section 6.1 may be seen as crypto background for an archi- 
tecture person: it presents costs of basic functions and derives informal guide- 
lines for efficient operation. Then, we formalize our architecture in Sections 6.2 
and 6.3, giving the description of our ZK machine as well as its instructions. 
Sections 6.4 and 6.5 then proceed in detail on the more interesting parts of our 
architecture: prover oracle calls and hardware interrupts. 


6.1 ZK Processor Intuition and Efficiency Guidelines 


Ultimately, our ZK processor is built on simple algebra implemented by an effi- 
cient ZK protocol. While our current goal is to construct an abstract architec- 
ture, we do so keeping in mind the concrete efficiency properties that our imple- 
mentation ultimately inherits. Thus, we informally discuss basic primitives and 
pieces of intuition underlying our extensions and improvements over [HK20al, 
on which we build. In particular, we derive basic guidelines for processor de- 
sign. Costs discussed in this section are based on concrete choices of security 
parameters: 128-bit computational security and 40-bit statistical security. 

As a reminder, the protocol allows P and V to operate on authenticated shar- 
ings of semantic values that appear during the processor execution. Sharings 
have two crucial properties: 


1. Sharings are oblivious, meaning that they convey no information about 
the semantic values to V. 


2. Sharings are authentic, meaning that it is infeasible for P to create a share 
that is not computed by a procedure agreed upon by both V and P. 


6.1.1 Protocol Primitives 


The protocol allows P and V to manipulate integers in the field Z, for a large 
prime p. The protocol’s primitive operations over sharings include addition, 
subtraction, and a form of vector-scalar multiplication (see Figure 2 and Ap- 
pendix B). This vector-scalar multiplication works as follows: (1) P holds a 
private scalar x € {0,1}, (2) the parties input a shared vector [yj], and (3) the 
parties output the shared scaled vector [æy]. 

These primitives provide a starting point for implementing the entire proces- 
sor state machine, including reading and decoding instructions, reading /writing 
memory, and operating over processor registers. While addition and subtraction 
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are in a sense ‘free’ (i.e. require no communication), vector-scalar multiplication 
requires the parties to interact via oblivious transfer (OT).? OT cost includes 
fixed overhead: executing its basic component Random OT requires either (1) 
transmission of 16 bytes (« bits) with “traditional” techniques or (2) significant 
computation with the recent Silent OT works [BCGT19, YWL*20]. OT also 
includes variable overhead: 5 bytes (ø bits) are sent per vector element. It is the 
handling and transmission of these OTs which bottleneck performance, so op- 
timizing the processor mostly involves decreasing the processor’s multiplicative 
complexity. Thus, we arrive at our first guideline. 


Guideline 1. Minimize multiplications. 


We realize Guideline 1 by aggressively amortizing vector-scalar multiplica- 
tion. This amortization takes two forms: 


1. We seek optimizations that amortize fixed OT overhead by favoring small 
numbers of multiplications of long vectors over large numbers of multi- 
plications of short vectors. One extreme example is given in Section 7.3, 
where we implement small table lookup using a logarithmic number of 
vector-scalar multiplications. 


2. We seek optimizations that re-use the same non-linear operations to com- 
pute different values, perhaps with the help of additional linear operations. 
As a simple example, our processor computes both bit-wise AND and bit- 
wise OR each cycle. Rather than computing both operations separately, 
we use multiplication to compute only bit-wise AND, and then use linear 
operations to derive bit-wise OR from the result.? 


6.1.2 Data movement 


A significant portion of the work done by our processor involves conditionally 
moving information from one place to another, most notably in the processor’s 
RAM. Conditionally moving data can be implemented by multiplication. While 
the available multiplication primitive places a constraint on the scalar x (x must 
be 0 or 1), the vector ¥ is unconstrained (each index can hold an arbitrary Zp 
element). Thus, whether we store individual bits in ¥ or if we store entire words 
in y, the cost of multiplication remains the same. This leads to our second 
guideline: 
Guideline 2. Favor moving data as words rather than as bits. 

Our processor manipulates 32-bit words. Guideline 2 instructs us to rep- 
resent all 32 bits as a single sharing rather than separately sharing each bit, 


reducing the cost of data movements by 32x. As an additional benefit, the par- 
ties store data more cheaply: if parties store data as shares of bits, they incur 


2OT is a foundational cryptographic primitive that allows a receiver R to learn one of the 
sender S’s two secrets (1) without R learning the other secret and (2) without S learning R’s 
choice. 

3Let a,b € {0,1}. Then a ^b = ab and a V b = a + b — ab. 
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40x storage blow-up. Word-based storage requires only 40 bits of storage per 
player per word. 


6.1.3 Oblivious RAM 


RAM is a tricky subject in ZK: on each RAM access, V must not learn the 
queried index. The typical strategy for implementing such an oblivious RAM 
(ORAM) involves carefully permuting data such that is impossible for V to 
observe an access pattern. We can permute a large array by performing a 
large number of small data movements via a construction called a Waksman 
permutation network [Wak68]. Unfortunately, permuting n elements requires 
O(nlogn) vector-scalar multiplications: 


Guideline 3. Avoid costly permutations. 


To this end, we make use of [HK20a]’s BubbleRAM construction, a ZK spe- 
cific ORAM that carefully amortizes permutations. We take this yet a step 
further by augmenting BubbleRAM with a caching mechanism, a trick that 
reduces the amortized RAM lookup cost from O(log” n) to O(logn) (see Sec- 
tion 7.1). While our improved BubbleCache is much faster than BubbleRAM 
per access, it introduces the possibility of cache misses. In practice, we have 
found that cache misses are rare enough that their possibility is more than made 
up for by decreased access cost (cf section 9). 


6.1.4 Mixing Boolean and arithmetic 


Guideline 2, which instructs us to store data in words, has a downside: we 
often need to perform bitwise operations, such as AND or OR. Such operations 
require each of the value’s bits to be represented by a distinct sharing. In order 
to both follow Guideline 2 and allow bitwise operations, we need a mechanism 
that decomposes integer values into their constituent bits. Bit decomposition 
can be built with the help of two protocol primitives: (1) a primitive that allows 
Y and P to convert a constant c to a sharing |c] and (2) a primitive that allows 
P to prove to VY that a sharing holds zero. Note, P can convert an input bit to 
a sharing: if P has a bit of input æ € {0,1}, the parties multiply < - [1]. Now, 
we can implement bit decomposition: 


decompose : Zos2 — {0, 1}? 


The procedure is as follows: (1) P separately inputs each output bit (she knows 
the output bits in cleartext), (2) the parties use addition to compose the bits 
into a word, and (3) P proves that the composed word is equal to the input word 
by proving their difference is zero. This last step proves that P’s provided bit 
decomposition is valid, so the parties output those shares. Each decomposition 
requires P to input 32 bits, a relatively expensive operation: 


Guideline 4. Avoid bit decomposition. 
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Our implementation performs exactly two bit decompositions every cycle: in 
general, each algebraic instruction has two input registers, and we decompose 
the values in both registers. Moreover, we amortize bit decomposition multipli- 
cations with those needed by other operations (per Guideline 1). The two 32 
bit decompositions, requiring 64 OTs, are the most expensive subcomponent of 
our processor’s ALU. 

The above guidelines inform many of our architectural decisions, and we 
refer to them to motivate our choices. 


6.2 High Level Architecture 
We now formalize our architecture. 


Construction 1 (Architecture). Our Architecture is the abstract machine de- 
scribed in this section and in Section 6.8. 


Values manipulated by our architecture are 32-bit integer words. Construc- 
tion 1 consists of the following components: 


e A 32-bit program counter pc. 


e An instruction memory Z. Instructions are stored separately from data in 
a read-only memory (ROM).4 Separating instructions allows us to perform 
fewer permutation operations than if we had stored them together with 
data (Guideline 3), particularly because ROMs can be efficiently imple- 
mented in ZK [HK20a]. Note, the program (i.e., the content of the ROM) 
is publicly agreed on by P and V, though the access order must be kept 
secret. 


e A registry R with 32 registers that each hold a single word. Instructions 
operate directly on the registers. 


e A main memory M with 2* words of memory for custom k. Memory is 
word-addressable (as opposed to byte-addressable, see Guidelines 2 and 4). 
Specific LOAD and STORE instructions move data between R and M. 


e An input tape, holding P’s input values. 


A publicly agreed upon running time T. 


At initialization, (1) pc is set to zero, (2) each word in R and M are set 
to zero, (3) Z is loaded with the program, and (4) T is chosen. The processor 
then executes T cycles. On each cycle, the processor reads instruction Z[pc] 
and accordingly performs a combination of the following: 


4Thus our architecture is a ‘Harvard architecture’, as opposed to a more typical ‘von 
Neumann architecture’ in which instructions and data are stored together. When proving 
the existence of program bugs, it is possible that a Harvard architecture could exhibit bugs 
differently than a von Neumann architecture. We did not observe this as a practical issue 
when replicating bugs - see Section 9.2. 
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1. Load up to two values from R. 

2. Load a value from M. 

Load a value from the processor’s input tape. 
Perform an ALU operation on loaded values. 
Store a value in R. 


Store a value in M. 


Bs SO TONS SA ile 


Update pc. 


The specific actions corresponding to each instruction are given in Section 6.3. 
After T cycles, the machine loads one final instruction: if Z[pc] holds the distin- 
guished instruction QED, then the machine outputs 1 (i.e., V accepts the proof) 
and otherwise outputs 0 (i.e., reject). 

Our architecture also includes hardware interrupts: on an illegal memory 
access, the processor jumps to a distinguished program location. Section 6.5 
explains our interrupts. 


6.3 Our Instruction Set 


Figure 3 enumerates the syntax and semantics of the twenty instructions in 
our architecture. We organize our instructions into four general categories: (1) 
instructions that perform simple algebra over the registers, (2) instructions that 
allow arbitrary control flow, (3) instructions that manipulate main memory, 
and (4) instructions that collect P’s input. We mention several interesting 
instructions: 


e QED is our distinguished ‘proof’ instruction. P succeeds if she provides 
input that leads to QED. 


e HALT is our distinguished ‘fail’ instruction: since the processor is stuck on 
HALT, it is impossible to subsequently reach QED and complete a proof. 


e ORACLE manages P’s input if P is honest. Section 6.4 discusses ORACLE at 
length. 


e PC provides the minimal functionality required for handling procedure 
calls: before jumping to a procedure, the processor must store pc such 
that it can properly return. 


Our instruction set carefully balances expressivity with each CPU step’s mul- 
tiplicative complexity (see Guideline 1). Looking forward to our microarchitec- 
ture, our ALU, which implements our algebraic instructions, can be computed 
using only 65 OTs because of careful amortization (see Section 7.2). Other can- 
didate instructions, such as a bitwise left shift, would not fully amortize and 


14 


Syntax Semantics 
MOV tar {src} Ritar|] + val(src) 
l if 
CMOV tar srco {srci} | R[tar] + Uae) 3 Rise] A0 
R{tar], otherwise 
ADD tar srco {srci} | R{tar] + R[srco] + val(srcı)) 
SUB tar srco {src} | R[tar] + R[srco] — val(src,) 
MUL tar srco {srci} | R[tar] + R[srco] - val(srci) 
Aleebra XOR tar srco {srci} | R[tar] + R[srco] ® val(srcy) 
Bente AND tar srco {srci} | R[tar] — R[srco] A val(srcz) 
OR tar srco {srci} Ritar] — R[srco] V val(srcz) 
1, if R[src] =0 
EQZ tar src R\tar| — f 
0, otherwise 
1, if R[src] > 23! 
MSB tar src Rjtar| — f 
0, otherwise 
POW2 tar src Ritar] — 2Plerel 
JMP {dst} pe + val(dst) 
(dst), if R 0 
BNZ src {dst} pe + vase eee á 
Control pc+1, otherwise 
ontro’ PO tar {src} Ritar] — pe + val(src) ; pc 4+ pe+1 
HALT —no effect, pc unchanged — 
QED —no effect, pc unchanged — 
Mend: LOAD tar ao {a1} Ritar] — M[R]ao] + val(aı)] 
Y STORE src ao {a1} M([R[ao] + val(a,)] — R(src) 
INPUT tar R{tar] + x where x € {0..23? — 1} from P 
P Input 


ORACLE {id} 


val(x) {i 


Ria], 


P calls procedure val(id) ; pc +- pc +1 


if x is an immediate 


if x is a register id 


Figure 3: Our instruction set. Each instruction type handles between zero and 
three arguments. In general, arguments refer to registers, but some arguments, 
denoted {-}, can also optionally be immediates (i.e., compile-time constants). 
val is a helper function that resolves an argument that can be either a register 
or an immediate. Unless the semantics otherwise mention an effect on the pc, 
each instruction also increments the pc. 
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would thus require additional OTs. Since left shift is easily implemented by two 
instructions (POW2 and MUL), we omitted this instruction® and others. 


6.4 P’s Input and Prover Oracle Calls 


Cryptographers often view the problem of arranging P’s input as outside the 
concern of ZK: P “just knows” her input. In practice, such omniscience must be 
arranged: P’s input can be arbitrarily complex. We believe that, to provide a 
suitable interface, a ZK architecture should cleanly integrate the handling of P’s 
input. Our architecture provides this integration via a concept called a prover 
oracle call. 

Prover oracle calls (or just oracle calls) are initiated by our ORACLE instruc- 
tion. P subsequently (1) exits the ZK proof, (2) computes in cleartext an 
arbitrary procedure over the current processor state, (3) loads values onto the 
input tape, and (4) re-enters the proof. With the input tape properly loaded, 
the ZK program can issue INPUT instructions to word-by-word read from the 
tape. 

P maintains a cleartext table of oracle call procedures. At runtime, she 
interprets the ORACLE instruction’s argument as an index, looks up the corre- 
sponding procedure, and runs that procedure on the current processor state. 
We provide a suite of useful oracle calls, and the programmer can extend the 
table by dynamically linking custom code. 

At a high level, we use oracle calls for two primary tasks. 

First, we use oracle calls to interface with P’s cleartext system. E.g., in 
our libc implementation, when the ZK machine reads a file, P escapes the ZK 
proof, reads from the target file on her local machine, and loads the resulting 
bytes onto the input tape. The ZK machine then loads the bytes off the tape 
via INPUT. See Section 8 for more discussion. 

Second, we use oracle calls to shortcut procedures that are expensive to com- 
pute but efficient to verify. Given a function f with input x, it is not necessary 
for P and V to compute f(x) under the ZK protocol. Many ZK protocols take 
advantage of this fact. We make this capability available to end users via oracle 
calls: P computes locally and loads f(x) onto the input tape (this clairvoyance 
is why we call them ‘oracle calls’). Now, the parties need only verify that x 
and f(a) are related by f. Verification is sometimes far easier than evaluation. 
E.g., while sorting a list requires O(nlogn) operations, verifying that a given 
permutation sorts a list uses only O(n) operations (see Section 9.4). As another 
example, bitwise right shift is expensive to compute directly with our instruction 
set, but can be easily verified using bitwise operations and multiplication. 

After a ’shortcut’ oracle call is complete, it is essential that P’s inputs be 
verified: a malicious P can provide arbitrary inputs rather than following the 
specification of the oracle call. 

Because we provide ORACLE as an instruction, such behaviors can be changed 
and extended by the programmer. This said, we view oracle calls as a feature 


5The key difference between POW2 and left shift is that it is efficient to ensure that POW2 
will not overflow the prime field. 
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primarily for ‘power users’; users new to ZK can write ordinary C without using 
the feature or even knowing that it exists. 


6.5 Hardware Interrupts 


Programs can enter inconsistent states, such as when dereferencing an invalid 
memory location. Such errors cannot be handled directly by the program it- 
self. Computer systems address such scenarios via hardware interrupts, special 
function calls performed at the time of fault. We introduce interrupts in our 
architecture as well. In computer systems, as well as in ZK, interrupts can be 
used more generally: for example, for event- and timer-based interrupts. 

Currently, our architecture only issues an interrupt on an illegal memory 
access. In this case, the processor sets pc to a statically determined, constant 
value MEMFAULT. By default, index MEMFAULT holds the HALT instruction; a pro- 
gram that issues an illegal memory access is by default a failed proof. However, 
the code at MEMFAULT can be customized. This is particularly useful for proving 
the existence of bugs: P might wish to demonstrate that a particular input 
will cause a program to access invalid memory. Rather than hard-coding this 
behavior, we take a more general approach: the code at MEMFAULT can be set to 
QED, or to any arbitrary program. 

As a final detail, we mark a protected region in the middle of the mem- 
ory space that, upon access, triggers an interrupt. This ensures separation of 
the stack and the heap, and is a standard security technique used in cleartext 
systems. 


7 Our Microarchitecture 


We now formalize our microarchitecture, i.e. the implementation of our archi- 
tecture. 


Construction 2. Our Microarchitecture is the concrete implementation of the 
Architecture (Construction 1) described in this section. 


Our microarchitecture manipulates authenticated sharings of values in the 
field Z, for a given prime p (concretely, we choose p = 24° — 87). Construction 2 
instantiates the architecture components as follows: 


e The program counter pc € Zp is a sharing [pc]. 


e The instruction memory Z is implemented by an efficient ZK ROM as 
specified by [HK20a]. The ROM consumes O(logn) OTs per instruction 
lookup. Our physical instruction format is described in Section 7.4. 


e The registry R is implemented by BubbleCache (see Section 7.1) with 32 
slots. 


e The main memory M is implemented by BubbleCache with 2% slots for 
publicly agreed k. 


17 


The input tape is implemented by allowing P to provide a 32-bit value on 
every cycle. We emphasize that the input tape is not concretely represented 
inside ZK. However, if P is honest and runs our implementation (note, 
there is no requirement that she does: a malicious P can run arbitrary 
software), she maintains a cleartext input tape that maintains her input 
choices. 


The number of time steps 7 is instantiated by a publicly agreed cleart- 
ext integer. P does not use the number of time steps prescribed by our 
architecture: she must also pad J to account for cache misses. 


At initialization, pc, R, and M are zero-initialized using the protocol’s sup- 
port for constants. Similarly, Z is initialized with the constant program text. 

The processor performs 7 cycles. At the start of each cycle, P inputs a single 
bit that allows her to skip the current cycle, including updating the pc. This 
capability allows P to skip cycles when the processor would otherwise incur 
a cache miss (see Section 7.1). On a skipped cycle, RAM continues to make 
progress towards a cache hit. 

On each cycle, the processor loads instruction Z[pc] and performs all of the 
following actions: 


Load two values from R. 
Load a value from M. 
Read an input from P. 


Based on the instruction, multiplex (1) the second register value, (2) the 
loaded memory value, (3) the instruction immediate, and (4) P’s input. 
This computes the second loaded value (the first is the first loaded register 
value). 


Compute all ALU operations on the two loaded values (see Section 7.2). 
Compute all possible pc updates. 
Multiplex the ALU output and pc update based on the instruction opcode. 


Compute mod 2°? on the ALU output, ensuring that the ALU output 
(which is a Zp element) fits in the set Zos2. 


Finally, the processor performs all of the following: 


Update a value in R. 
Update a value in M. 


Update pc. 
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In many cases and depending on the instruction, each ‘update’ may leave the 
target untouched. 

We describe many of the details above in the following sections. At a high 
level, our microarchitecture is the relatively straight-forward implementation of 
our architecture, albeit with many optimizations. Our key claim is the following: 


Theorem 1. Assuming a collision resistant hash function, that the prime mod- 
ulus p > 237, and that |logp| > o, our microarchitecture (Construction 2) 
is a sound (with soundness error 277) and complete malicious-verifier Zero 
Knowledge machine that proves arbitrary ZK relations expressed in our ISA 
(Construction 1) in the OT-hybrid model. 


The proof of this theorem relies on the JKO framework [JKO13] and follows 
naturally (1) from discussion throughout this section, (2) from the security of 
the [HK20a] protocol, and (3) from the fact that we check P’s input values where 
appropriate. We defer a more formal treatment of the proof of Theorem 1 to 
Appendix D. 

We use ø = 40; increasing o will lead to a linear cost increase. 


7.1 BubbleCache 


[HK 20a] introduced BubbleRAM, a ZK-specific ORAM with high concrete per- 
formance. For a RAM with n elements, BubbleRAM consumes only 5 log? n 
OTs per access. On each access, BubbleRAM allows P to shuffle a portion of 
the RAM. By doing so she ensures that, on each access, the needed element is 
in the zeroth slot of RAM. P proves that her shuffles are correct by proving 
that each requested index is equal to the index stored in RAM. Permutations of 
large arrays are expensive to compute: BubbleRAM achieves each permutation 
via a Waksman permutation network [Wak68], a circuit that consumes nlogn 
OTs. However, BubbleRAM is carefully scheduled such that large permutations 
are infrequent. Thus, permutation cost is amortized across accesses. 

BubbleRAM’s design ensures that RAM is strictly correct: the parties are 
guaranteed to successfully look up the requested index. In this section, we 
show that BubbleRAM can be significantly improved by removing strictness. 
At a high level, we modify BubbleRAM such that memory is shuffled less often. 
Our modification, which we call BubbleCache, must therefore allow for cache 
misses. In practice, the frequency of cache misses depends on how aggressively 
we reduce shuffling and on the proof statement. While the frequency of shuffling 
can be chosen arbitrarily, we elect a strategy that results in O(logn) OTs per 
access both because of the asymptotic improvement and because it is effective 
in practice. 


7.1.1 BubbleRAM formal review 


BubbleRAM’s key primitives are oblivious partitions. A partition on an array 
of n elements allows P to select + elements and to permute the array such that 
n 


those 5 elements are in the first } array slots; the remaining 5 elements are 
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placed into the last } slots. Aside from partitioning these two halves of the 
array, the partition guarantees nothing about the relative order of the elements. 
An oblivious partition on n elements can be achieved by O(nlogn) OTs via a 
Waksman permutation network. 

Let N be the number of elements in RAM. BubbleRAM maintains a time- 
counter t. t is initialized to zero; after each access, t is incremented. The key 
idea is to carefully partition memory on a schedule according to t. On each 
access, BubbleRAM considers each power of two 2* < N starting from the 
largest such k and working down to k = 1. BubbleRAM allows P to partition 
the first 2” elements if 2*—1 | t, that is if 2*-! divides t. This strategy is based 
on the following argument. If P places the next 5 needed elements into the first 
5 slots, then we need not look at the last 5 slots for the next 5} accesses: it is 
impossible that a needed element would be in the back of the array. However, 
once we have exhausted 5 accesses, P must repartition memory to again account 
for the fact that needed elements might be at the back of RAM. BubbleRAM 
stores each RAM element in a tuple together with its index, allowing the CPU 
to efficiently check that P properly moves each needed element to slot zero. 


7.1.2 Adding Caching to BubbleRAM 


It is widely understood that, in practice, programs exhibit high data locality: 
once an element is used, it is likely that the same element will be needed again 
soon. As described so far, BubbleRAM ignores this fact. To take advantage of 
data locality, we will allow BubbleRAM to “miss” from time to time. We use this 
allowance to “spread out” the BubbleRAM schedule, reducing the number of 
partitions and hence the overall cost: we can partition the first 2% elements when 
f(k) | t for arbitrary f.° By varying f, we can trade between RAM performance 
(a sparse schedule is cheap) and cache miss-rate (in a sparse schedule, it is likely 
P will be unable to put the next needed element in slot zero at the correct time). 
By spreading out the schedule, we essentially transform BubbleRAM from an 
efficient array to an even more efficient memory space with log N levels of cache. 
Each cache level can be given its own schedule, allowing us to tune cost trade-offs 
at each level. 

Our processor accounts for cache misses by allowing P to inject no-ops: 
P predicts if the RAM will miss and, if so, instructs the processor to skip a 
cycle. By doing so, she skips BubbleRAM’s index comparison check, which she 
would otherwise certainly fail because the needed element is not in slot zero. 
During the skipped cycle, the RAM continues to increment its time counter and 
perform partitions, so the skipped cycle makes progress towards a cache hit. 
This preserves ZK, since no-op cycles are indistinguishable to V. 

Cache misses fit well in our CPU emulation-based architecture where no-ops 
can be easily injected. The same cannot be said of direct circuit representation- 
based architectures, where each potential miss must be conditionally handled. 


6Or we can use any Boolean valued function as our scheduler. We focus on periodic 
schedules because they are natural and effective. 
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This added handling further exacerbates the already challenging problem of 
circuit-based control flow. 

The value of cache-based memory is well understood in the world of cleartext 
computing. Our specific choice of f is loosely based on cleartext caching where 
each progressively larger level of cache is more expensive than the last, mapping 
to the idea that the more recently a piece of data was used, the more likely 
it will be needed again soon. One curious point is that, in terms of miss-rate, 
we outperform cleartext RAM since P has the rather unique opportunity to 
use Belady’s optimal page replacement algorithm to optimally program parti- 
tions [Bel66]. 


7.1.3 BubbleCache 


We now present our specific choice of schedule. Consider the function f which, 
for an input x, computes the nearest power of two greater than z: 


zt £20087] 
Let C be a constant. We use the following scheduler: 
(C-k-281) t | ¢ (1) 


That is, we delay partitioning by factor C - k (we vary C in Section 9; values 
near 4 yield high performance). The function + ensures that the partition period 
for level 2 divides the period for each level 7 > i. This prevents partitions at 
large levels of cache from ‘ruining’ partitions at low levels of cache (recall, we 
do not assume anything about the relative ordering within the two halves of a 
partition): at the time we partition a large level, we will subsequently partition 
all smaller levels. 


Construction 3. BubbleCache is BubbleRAM whose scheduler is configured 
according to Equation (1). 


It is difficult to argue analytically that logarithmic delay is a good choice: 
data locality depends on the application and the input, and so is highly variable. 
We choose logarithmic delay for two reasons: (1) it yields excellent per-access 
performance both asymptotically (see Theorem 2) and concretely and (2) in our 
experiments it yields low cache miss-rates. More efficient caching algorithms are 
possible, especially if we are willing to base the caching strategy on the target 
program. However, any improvements are likely to be small since BubbleCache 
has only log N cost with small constants. 


Theorem 2 (BubbleCache efficiency). BubbleCache consumes O(log N) OTs 
per access. 


Proof. By amortizing partition costs across accesses. 
Consider a single element as it moves through the levels of cache and eventu- 
ally ends up in slot zero. That element’s arrival in slot zero is arranged by log N 
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partitions, one for each power of two less than N. Each partition of n = 2% ele- 
ments consumes O(nlogn) OTs and is amortized over (C - k- 2*~') f accesses. 
Thus, we summarize the cost of a single access: 


REN ok . Jog(2*) REN k. ok 


k=1 k=1 


Thus, each access consumes O(log N) OTs. 


We use BubbleCache to implement both our main memory and our reg- 
istry. The caching mechanism greatly decreases the communication cost of each 
processor cycle, and we have found that the cache miss-rate is generally low. 


7.2 ALU: Efficient Arithmetic 


Our processor implements arithmetic instructions via an arithmetic logic unit 
(ALU). We defer a formal specification of the ALU to Appendix C. Here, we 
note interesting points: 

Our construction uses only a 40-bit prime field, as compared to the 64-bit 
field used by [HK20a]. This is a significant improvement: the sizes of messages 
transmitted by OTs are reduced. The key reason [HK20a] chose a 64-bit prime 
is because naïve multiplication of 32-bit numbers can yield a value as high as 
(232 — 1)”. A 64-bit prime field ensures that even this worst-case multiplication 
will not overflow. We designed a form of “bit-wise” multiplication that outputs 
a maximum value 32 - (23? — 1) < 23”: hence our choice of a 40-bit prime field 
is informed by security, not by correctness. 

Our ALU aggressively amortizes multiplications: the ALU, which handles all 
arithmetic instructions, needs only 65 OTs. Our ALU significantly contributes 
to our overall performance. 

Our architecture operates over 32-bit integers, but our protocol only supports 
a prime field. Thus, we must embed 32-bit integers in the field. We achieve this 
by computing mod 2°” on each ALU output. We do so cheaply: after the ALU 
runs, P provides input that subtracts the top bits off an overflowing value before 
we store the result in the output register. Although P chooses this input freely, 
this trick is automatically secure: our ALU always decomposes its arguments 
into 32 bits, and the validity of the decomposition is checked. This check passes 
only if P honestly subtracted off the top bits appropriately. This does mean a 
malicious P can store a non-32-bit value in the registry, but she cannot use this 
method to cheat: she cannot read a bad value from the registry without being 
caught. 


7.3 Multiplexing: Efficient Small Table Lookup 


Each cycle has two primary tasks: (1) updating the program counter and (2) 
updating the output register (storing to main memory is also an important ef- 
fect, but is handled separately). Each of the twenty instructions in our ISA 
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Figure 4: Our instruction physical format. Each instruction has six fields: (1) 
a single bit isImm that indicates if the instruction uses an immediate, (2) the 
instruction’s op-code, (3-5) up to three register arguments, and (6) the full 32- 
bit optional immediate. The first five fields are packed into one word (ten bits 
are left unused) while the immediate is held in a separate word. 


produce different such updates. Each cycle, we multiplex over the twenty pos- 
sible updates and only apply the current instruction’s update. Multiplexers 
can be trivially implemented by large numbers of multiplications, but this is 
somewhat inefficient. Here, we show an improvement that uses vector-scalar 
multiplication to more efficiently implement a multiplexer. The key idea is that 
we organize values into a small table, then use vector-scalar multiplication to 
recursively discard half the table until only one table entry remains. The full 
lookup procedure uses logn OTs to compute a lookup on n elements. 

Let T be a table of n elements 7}, ..., Tn/2, Tn/2+1; = In. Suppose P wishes 
to look up element i, and let 7; be the first bit of this index: 7, indicates if the 
needed element is in the first half or the second half of the table. To partially 
look up based on 71, the parties compute the following expression: 


(Ti, < Tny2) F ay i ((Taj2+1; aa Tn) = (Th, is A 78) 


If i; = 0, this expression computes (T,...,T,/2); otherwise, the expression 
computes (Tj,/241,+-;IZn). This expression requires a single OT of length n/2 
secrets. We can recursively narrow down to a single element using logn OTs. 
The total length of OT secrets is n. 

At the end of each cycle, we construct a table of three-tuples. Each tuple 
contains elements corresponding to one instruction: (1) the resultant program 
counter, (2) the resultant output-register value, and (3) the corresponding in- 
struction op-code. After we perform a table lookup, we (1) update the program 
counter, (2) update the output register, and (3) check that the instruction’s 
op-code matches the looked up op-code (see Section 7.4 for further discussion 
of this check). 


7.4 Instruction Format and Decoding 


On each cycle, the processor performs operations according to the current in- 
struction. Instructions are stored in an efficient read-only memory that requires 
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only 2log N OTs per lookup [HK20a]. However, simply searching for the in- 
struction is not sufficient. We must also ensure that the semantic action of the 
processor (how we update the program counter, registry, and main memory) is 
consistent with the instruction. 

Figure 4 shows our instruction physical format. Each instruction holds an 
opcode, up to three register arguments, an immediate, and a flag that indicates 
if the immediate is used. 

The first interesting aspect of our instruction format is that we store the 
immediate separately from the other fields. This makes reading the immediate 
trivial: no bit decomposition is needed to extract the immediate. 

Our registry is implemented by BubbleCache. Hence, on each access, the 
RAM yields not only the looked up element, but also that element’s index. 
Note also that the opcode is given as output from the cycle’s multiplexer (see 
Section 7.3). Finally, P chooses isImm as input during an OT that allows her 
to swap the third register argument for the immediate. Thus, each bit of the 
instruction’s first word is represented twice: once in the physical instruction, and 
once in the corresponding parts of the processor cycle. Therefore, checking that 
the processor’s action is consistent with the current instruction is extremely 
efficient: the parties use linear operations to compute a second copy of the 
instruction’s first word, then check that the two copies are equal. This means 
that the first word of each instruction is never decomposed into bits, a significant 
improvement over [HK20a], where a large number of OTs were needed to decode 
each instruction. 


7.5 Hardware Interrupts 


Recall, our architecture jumps to program location MEMFAULT when the program 
accesses an illegal memory address (Section 6.5). Checking that a given memory 
address is legal requires performing comparisons and is expensive. 

Fortunately, our processor supports skipped cycles, so we can handle inter- 
rupts cheaply: When an illegal access is attempted, P has no choice but to 
continually skip cycles. If she does not, BubbleCache will catch the attempt to 
index a non-existent address. With the processor successfully stuck, we need 
only make one change: our processor periodically checks if it is experiencing a 
fault (we check every 10,000 cycles). This check is still relatively expensive, 
but is performed so infrequently that the cost is nearly irrelevant. If the check 
indicates a fault, our processor sets pc to MEMFAULT, and P can continue the 
proof from the new state. 

Recall, our architecture maintains a protected region in the middle of mem- 
ory. This region is implemented by simply deleting addresses from Bubble- 
Cache’s address space. 
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Figure 5: Our system design. Software components that we implemented are 
marked with a star; other software is either provided at design time or is off 
the shelf. To use our system, the programmer needs only provide a C program. 
The C program is translated to our instruction set by our custom compiler. The 
programmer provides the resulting assembly to both V and P, who together run 
the program via the ZK protocol. When the protocol finishes, V either accepts 
or rejects the proof. As discussed in Section 6.4, P’s runtime interfaces with 
her local system via prover oracle calls. We provide a library of common oracle 
calls, including implementations of low level operations and of system calls. The 
programmer may optionally augment the oracle call suite by compiling custom 
code and dynamically linking it with P’s runtime. 
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8 ANSI C Implementation and System Design 


The architecture described to this point requires an expert to craft the needed 
assembly programs and oracle calls. Developers should instead be empowered to 
write ZK proofs in a familiar way. We achieve this goal by implementing ANSI 
C95, a familiar and complete programming environment. Our implementation 
consists of two major components: (1) our C compiler lifts our ISA and allows 
programmers to write natural programs, and (2) our runtime library allows 
programmers to access data through the C standard library (libc) interface. 

Our system is depicted in Figure 5. Users compose proofs as C programs that 
are compiled by our custom compiler, based on LLVM [LA04]. The resulting 
assembly runs on our ZKM. While running, the ZKM may communicate with 
P’s clear-text system through oracle calls. Users may define and compile custom 
oracle calls to further extend this capability. 


8.1 User Experience (UX) 


We view the compiler and runtime libraries as the user-interface to our ZKM. 
Without a convenient interface, it is impractical for users to construct non-trivial 
ZK statements. We highlight the goals of our UX: 


High-Level, Standard, Composable Programs Programmers should rep- 
resent their logic in a standard high-level language, and that logic should be 
composable. Programmers should be able to use existing code to quickly and 
succinctly develop complex statements. 


ZK execution that follows plaintext execution System and library calls, 
such as fread are transparently handled by our runtime library. When such 
calls are executed, honest P’s runtime uses an oracle call that ‘mirrors’ the 
call on her local system. So for example, if the ZK program attempts to read 
a file, P generates the appropriate ZK input by reading that file on her local 
system. This mirroring makes ZK execution ‘feel natural’: performance aside, 
it is difficult to detect differences between running a program in clear-text and 
running the same program on our ZKM in prover mode. 


Minimal instrumentation Standard C programs can be transformed into 
proof statements by simply adding the special instruction QED. Reaching QED 
constitutes a successful proof, so the programmer should place QED behind ap- 
propriate checks. 


8.2 C Compiler 


To support high-level, standard, composable programs, we provide a C compiler. 
Our compiler is built as a custom backend to the LLVM compiler toolchain, using 
the Clang C front-end. Clang translates C programs into LLVM’s intermediate 
representation (IR), LLVM optimizes that IR, and a backend translates the IR 
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into the target architecture’s instruction set. To support the ZKM, we add a 
new ZK backend that translates LLVM IR to our ISA. Much of this translation is 
well supported by our ISA, and while a significant effort, it is relatively straight- 
forward. However, there are several classes of operations in the C language that 
need special handling. 

First, there are division-like operations (divide, modulus, right-shift) which 
are not native to our ISA and must be emulated. We do so efficiently via oracle 
calls (cf Section 6.4). Our LLVM backend compiles division, right shift, and 
modulus operations into oracle calls with associated proofs. 

For example, consider logical right-shift: x >> y. Our compiler replaces 
right-shifts with appropriate oracle calls and associated proofs. Via the oracle 
call, P provides the right-shift result as input. Then, V checks in ZK that the 
value is correct (recall, P is untrusted). V checks two properties. First, P’s low 
32 — y bits must match the upper bits of x. Thus, the parties left-shift P’s bits 
by y and compare the result with the upper bits of x. Second, P’s high y bits 
must be zero. We use AND to zero out y’s low bits, and we check that the result 
is zero. With these checks complete, V is confident that P’s provided right shift 
result is correct. 

Second, LLVM and C have byte-addressable memory, but our ISA only sup- 
ports word-addressable memory. To allow sub-word granularity, our compiler 
translates sub-word memory accesses into full-word accesses. For loads, the 
compiler loads a word and masks out the needed bits. For stores, the compiler 
(1) loads the current value, (2) overwrites only the appropriate bits, and (3) 
performs a word-aligned store. With these operations, our compiler supports 
the full range of integer and pointer operations in the C language. 

Third, there are floating point operations. Our compiler currently emulates 
floating point by translating each floating point operation into a library call. 
These library calls emulate floating point via integer arithmetic. While correct, 
this approach is slow and could be improved by future work. 

Further improving our compiler, especially by incorporating optimizations, 
is an exciting direction for future work. We hope to add optimizations that take 
advantage of some of the ZK ISA’s strengths (e.g. large register set), and hide 
some of its weaknesses (e.g. word-aligned memory). We also hope to explore 
architecture-compiler co-design, such as that of VLIW architectures [Fis83] to 
construct more efficient ZK systems. 


8.3 Runtime System Support 


Real-world programs interact with an external system: they communicate with 
other programs, read files, output data. However, our ISA only supports a 
primitive input interface. We therefore provide runtime support that lifts these 
low-level operations into a C standard library interface (libc). Our libc en- 
ables the ZKM to interact with processes and files on P’s host system, allowing 
standard C programs, such as sed and gzip, to run on the ZKM without mod- 
ification. 
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We base our libc implementation on Newlib [RHS20]. Newlib builds the 
higher-level portions of libc on top of libgloss, a collection of the standard 
library’s lowest level behaviors. We provide a custom libgloss implementation, 
translating its sixteen system interaction functions, such as -read (the libgloss 
version of read), into oracle calls. Figure 6 lists a simplified version of our _read 
implementation, which provides an example of how our entire standard library 
is structured.. With libgloss implemented, Newlib provides the remaining 
standard library functionality, and thus we fully support libc. 


8.4 System Limitations 


Our system supports ANSI C95 programs. However, many applications rely 
on POSIX interfaces and/or features of newer C standards (C11, C17) beyond 
ANSI C95. In many cases, integrating these more complex interfaces, while pos- 
sible, would require significant additional engineering. In other cases, additional 
‘hardware’ support would be required. For instance, mprotect requires memory 
protection, requiring our ZKM to add an MPU. 

While the security of our ZKM is formally proven, the correctness of our 
compiler is not. Consequently, a skeptical P or V should inspect the compiled 
assembly. One possible future direction is to formally verify portions of the 
compiler [Ler09]. 

Our libc implementation does not validate data provided by the external 
system: such input is seen as part of P’s witness. For instance, a database 
application may assume it always reads from a well-formed database. Our libc 
will only return data read from the file, but will not check its contents. If 
an application has input constraints, it should include logic that checks those 
constraints. Put another way, only the application itself is authenticated in ZK. 
The execution environment is fully under P’s control, and she can arbitrarily 
and adaptively alter it: e.g., she can change files, delete files, or start/stop 
interacting processes. We leave embedding more of the execution environment 
in the ZK proof as future work. 

ORAM size is an important consideration: BubbleCache is stored in corre- 
sponding data structures on both P and V’s systems, and so the size-n Bub- 
bleCache is bounded by the hardware RAM of their systems. Additionally, the 
largest permutations employed by BubbleCache impose significant overhead: 
the largest permutation requires P and Y to execute tnlogn OTs. Further, we 
perform all OTs at the start of the protocol. This can be taxing for weak P/V. 

Natural directions for future work include more efficient floating point sup- 
port, compiler optimizations, and adding a linker, among others. 


9 Evaluation 


In this section, we describe the implementation and evaluation of our system. 
We emphasize our system’s concrete performance: our system runs at up to 
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11.89KHz with a substantial RAM (see Figure 10 for detailed performance re- 
ports, including clock rate, on several benchmarks). This is a ~ 5.5x improve- 
ment over the clock-rate achieved by [HK20a] for a more expressive instruction 
set. 


9.1 Implementation Details and Testing Environment 


We implemented our approach in C/C++. Our front-end compiler and standard 
library together are ~ 6KLOC. Our backend cryptographic ZK machine im- 
plementation is ~ 3KLOC. Our compiler is based on LLVM version 10.0.0 
and newlib 3.3.0. We instantiated Oblivious Transfer using the recent Ferret 
correlated OT [YWL* 20] as implemented by the EMP Toolkit [WMK16]. All 
benchmarks were run on a commodity laptop: a MacBook Pro with an Intel 
Dual-Core i5 3.1 GHz processor and 8GB of RAM. The two parties communi- 
cated over a simulated 1Gbps LAN with 2ms latency. Unless otherwise stated, 
we instantiate BubbleCache using delay parameter C = 5. 


9.2 Bug Analysis and Detection Methods 


In our first experiment, we demonstrate our system’s ability to handle real-world 
programs. We pulled buggy versions of two popular Linux programs, sed and 
gzip, off the shelf. We added minimal instrumentation by inserting QED calls 
at appropriate locations. Our system successfully proved in ZK the existence of 
the bug in each program. 


9.2.1 sed 


Older versions of sed (we use 1.17) contain a segmentation fault bug listed in the 
Software-artifact Infrastructure Repository (SIR) [DER05]: omitting the final 
newline character from the input file results in a length underflow, which causes 
sed to attempt to overwrite its entire address space. This version of sed has 
5.4KLOC. The bug exercises undefined behavior and could exhibit itself in a 
number of ways. Our architecture detects this bug through our memory protec- 
tion mechanism. We configure our memory hardware interrupt, MEMFAULT, to 
hold QED (see Section 6.5), causing a fault to conclude the proof. The hardware 
interrupt is triggered when sed attempts to write to the protected region of 
memory between the stack and heap. 

We instrumented BubbleCache with 213 words of memory. QED is reached 
after 390,002 cycles, shortly after completion of a hardware interrupt. The 
proof ran in 36.1s, yielding a clock-rate of 11.1KHz. The execution incurred 
6,470 cache misses and 45,912 skipped cycles. 


9.2.2 gzip 


The bug (CVE-2005-1228 [CVE05]) allows an attacker to illegally write to an 
arbitrary directory. When gzip decompresses a zip file, the output directory 
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is intended to be named according to a prefix of the input file name. Under 
certain inputs, gzip will erroneously write to an arbitrary directory chosen by 
the attacker. We used gzip 1.3, which is believed to be the last version with 
this bug. This version of gzip has 5.4KLOC. We detect the bug by placing 
string comparison logic immediately before opening an output file. 

We used BubbleCache with 2!” words of memory and C = 5. The proof uses 
44, 092 cycles and runs in 6.5s, yielding a clock-rate of 6.8KHz. gzip runs slower 
than sed because we use a larger RAM and do not issue enough instructions to 
fully amortize the largest BubbleCache partitions. The execution incurred 554 
cache misses and 2, 868 skipped cycles. 


9.3 BubbleCache Performance 
9.3.1 Comparison to BubbleRAM 


We compare the communication cost of BubbleCache to that of BubbleRAM. 
We ran the trivial program, a single QED instruction, but ensured that the proof 
ran for sufficient cycles to fully amortize RAM cost. RAM cost is program inde- 
pendent: the processor permutes and accesses RAM on every cycle, regardless of 
the program. We instantiated RAM with both BubbleCache and BubbleRAM 
of various sizes. Figure 7 plots the resulting RAM communication consump- 
tion. The results clearly show BubbleCache’s asymptotic advantage. With 217 
memory words, BubbleCache improves communication by more than 8x over 
BubbleRAM. 


9.3.2 Cache Miss Rate and Impact 


BubbleCache introduces the possibility of a cache miss. To better understand 
BubbleCache performance, we ran our system with different configurations of 
RAM and against different benchmarks. A full tabulation of the results is 
in Appendix E. Figure 8 gives a high level summary of the results: in our 
benchmark suite, despite the fact that RAM is only a single component of our 
processor step, using BubbleCache provides substantial improvement. 


9.4 Prover Oracle Call Shortcuts 


As discussed in Section 6.4, oracle calls can shortcut the evaluation of functions 
that are expensive to compute but cheap to verify. We demonstrate this with a 
concrete example. We implemented both merge sort and ‘input and prove’ sort. 
In the latter, P locally sorts the array, then provides the sorting permutation as 
input. In ZK, the players check that the input is indeed a permutation, apply 
the permutation, and check the list is sorted. This algorithm uses only O(n) 
ZK computation. 

We ran the two algorithms on lists of 10, 100, and 1000 elements. ‘Input 
and prove’ sort used respectively 1.47x, 3.92, and 6.35x fewer ZK instructions 
than merge sort. 
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9.5 Maximum Memory Size 


In our benchmarks reported in Figure 10, we configured RAM with the mini- 
mum size needed to correctly execute each program. We did so to demonstrate 
maximum performance. However, our system can also be configured with much 
larger RAM. Thus, we experimented with the maximum size of RAM that our 
system can handle. We ran our gzip experiment with 274 words, or 32MB, of 
main memory. The proof completed successfully in 199s. The proof is substan- 
tially slower than when run with 21” words because of large partitions performed 
on the first RAM access. When we attempted 27° words, our modest laptop 
exhausted its memory resource and the program crashed. With programming 
improvements or by using better hardware, we could scale to larger main mem- 
ory. 
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A Succinct and Non-Interactive ZK Related Work 


Ishai et al. introduced the ‘MPC-in-the-head’ technique [IKOS07]: here, P emu- 
lates in her head an MPC evaluation of the proof among virtual players. V checks 
random portions of the evaluation transcript and thus gains confidence that the 
prover has a witness. By allowing V to inspect transcripts of only some virtual 
players, the protocol protects P’s secret. [IKOSO07] inspired a number of subse- 
quent MPC-in-the-head works, e.g. [GMO16, CDG*17, AHIV17, KKW18}. 

Succinct non-interactive arguments of knowledge (SNARK) are extremely 
small proofs that can be quickly verified [GGPR13, PHGR13, BCGT13, CFH™ 15, 
Grol6]. Early SNARKs require a semi-trusted party, so more recent works de- 
veloped STARKs (succinct transparent arguments of knowledge) [BBHR18]. 
STARKs do not require trusted setup and rely on more efficient primitives. 
[BFH* 20]’s Ligero++ combines techniques of [AHIV17, BCR*19]. 

[HK20b] extensively analyzed many of the above works: namely [KKW18], 
Ligero, Aurora, Bulletproofs, STARK, and Libra [KKW18, AHIV17, BCR*19, 
BBB*18, BBHR19, XZZ*19]. Their analysis demonstrates that while these 
works have excellent performance in certain aspects (e.g., small proof size, fast 
verification time, non-interactivity), they struggle to handle large proofs moti- 
vated by problems like proving the existence of a bug in a program. 


B [HK20a] Protocol, Extended 


Section 5 introduced [HK20a]’s arithmetic ZK protocol and its authenticated 
sharing representation. Here, we continue by showing how the protocol primi- 
tives are implemented. 


Proofs of zero and of using correct inputs P must frequently prove that 
her intermediate inputs are chosen correctly (cf, e.g., Section 6.1.4). This is 
achieved by proving equality of corresponding values, which, in turn is achieved 
by proving that particular shares each encode zero. This is simple: P sends 
her share to VY and VY checks that the two shares sum to zero. We use a simple 
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optimization: rather than sending each proof of zero separately, P instead ac- 
cumulates a hash digest of all proofs and sends this to V. Thus, P needs only 
send « bits to V to prove the validity of every zero check. 


Linear Operations Sharings support efficient linear operations. Linear op- 
erations do not require communication; the parties need only operate locally on 
their respective shares. 


e To add sharings, parties locally add the shares: 
[x] + [y] = (X, zA — X) + (Y,yA—-Y) 
AUX +Y, (e+ yA-(X4+Y)) = [z +4] 


To subtract sharings, parties similarly subtract the shares. 


e To multiply a sharing by a public constant, the parties locally multiply 
the shares by the constant: 


cle] = c(X, cA — X) £ (cX, cxA — cX) = [ca] 
e Parties encode constants as follows: (cA, 0) = [e] 


Vector-Scalar Multiplication [HK20a]’s protocol provides non-linear oper- 
ations via a special form of share multiplication where one share is known to 
be in {0,1}. We give a simple generalization where instead P scales an entire 
vector by a private bit. This generalization can achieve [HK20a]’s multiplica- 
tion by incorporating public constants and a zero proof; in fact, the composed 
protocol is identical to [HK20a]’s. Our extension also allows other usage, such 
as fast small table lookup (Section 7.3). We note that [HK20a] did mention 
some special case usage of vector-scalar multiplication, but did not frame the 
formalization as generally as the following: 

Let x € {0,1} be held by P and let y1,...,ym E Zp be a vector. The parties 
hold sharings [y1], ..., [yn] and wish to compute [æy], ..., [zyn] (while P’s input 
x is not authenticated, it could be verified later by an appropriately applied proof 
of zero). First, P locally multiplies her shares by x. Thus the parties together 
hold: 

(Y1, 7yiA — xY1),..., (Yn, ynA — Yn) 


These intermediate sharings are invalid: the shares in the ith sharing do not 
sum to y;A. To resolve this, the parties participate in a single 1-out-of-2 OT 
where V acts as the sender. V uniformly draws n values Y/ €g Zp and allows P 
to choose between the following two vectors: 


Vinay Y -Yna ay, 


P chooses based on x and hence receives as output the vector Y/—xYj,..., Y! — 
xY,. The parties can now compute a valid sharing for each vector index: 


(Yj, zyiA — 2Y; — (Y; — 2Y;)) = (Yj, cyiA — Y/) = [xy] 
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A vector-scalar multiplication of a length n vector requires a 1-out-of-2 OT of 
n{log p|-bit secrets. Since the transmitted terms Y/ are random, it suffices to 
use correlated OT [ALSZ13, KK13]. In practice, we instantiate multiplication 
with the Ferret correlated OT technique [YWL* 20]. 

Security. We stress that there is no security consequence in using the 
above generalization of [HK20a]’s multiplication primitive: [HK20a]’s correct- 
ness, soundness, and verifiability proofs of the underlying garbling scheme are 
each trivially adjusted to accommodate this more powerful primitive. Thus, the 
extended share algebra of [HK20a] is a malicious-verifier ZKP system. 


C Arithmetic Logic Unit 


Figure 9 lists the core of the ALU. Our ALU aggressively amortizes multiplica- 
tions: in total, our ALU uses only 65 OTs. The listed code does not handle MOV 
(which is trivial) or CMOV (which is handled by one additional OT). The ALU 
outputs are multiplexed by a small table lookup (see Section 7.3). 

ADD, SUB, and MUL can each overflow Zs2. The highest overflow comes from 
our multiplication algorithm, which can produce values as high as 32 - (23? — 1). 
We account for overflow by allowing P to input a value x € {0..31}. The 
parties subtract 2°? . |x] from the ALU output, allowing P to subtract. off 
the top bits. At first, this may seem insecure. However, note that the ALU 
bit decomposes each input register on each cycle, and the validity of these 
decompositions are checked. These checks can succeed only if the registers hold 
valid Zo32 elements. Thus, it is possible for a malicious P to store an element 
not in Zə32 in the registry, but she cannot operate on such an illegal element 
without being caught. 


D Proofs 


Section 7 claimed Theorem 1: our microarchitecture is sound and complete. We 
now formalize in detail and prove this claim. Let x be the computational and 
o be the statistical security parameters (e.g., k = 128,0 = 40). 

To formalize security, we must capture all of P’s OT choices. Therefore, we 
define the notion of an extended witness, which includes not only the program 
input, but also all ‘supporting’ OT choices needed to complete the proof (e.g., 
during bit decomposition P selects OT outputs corresponding to the decom- 
posed integer). 


Definition 1 (Extended Witness). The extended witness is the complete col- 
lection of P’s inputs, including supporting choices that are not syntactically part 
of her witness. 


Definition 2 (Proof Microarchitecture). A proof microarchitecture u is a pro- 
tocol (P,V). Let P be a program in our ISA. P takes as input P and an extended 
witness I. V takes as input P. At the end of the protocol, V outputs O or 1. 
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Definition 3 (Proof Microarchitecture Completeness). A proof microarchitec- 
ture completely implements our architecture if for all programs P in our ISA 
and all extended witnesses I such that P(I) reaches QED, P(P,I) causes V to 
output 1. 


Theorem 3. If the prime field modulus p is greater than 2°", then our microar- 
chitecture (Construction 2) is a complete implementation of our architecture 
(Construction 1) in the OT-hybrid model. 


Proof. By inspection of the microarchitecture (Section 7) including the correct- 
ness of BubbleCache (Section 7.1), of [HK20a]’s oblivious ROM, of small table 
lookup (Section 7.3), and of our ALU (Figure 9). Our microarchitecture is built 
on [HK20a]’s share algebra, which, given that OT is correct, is complete. At a 
high level, our microarchitecture is a relatively straightforward implementation 
of the architecture. We mention several of the interesting points. 


e BubbleCache introduces cache misses, which are not part of our architec- 
ture. We nevertheless correctly implement the architecture because (1) P 
skips cycles that would otherwise experience a cache miss and (2) because 
BubbleCache makes progress towards a cache hit during skipped cycles: 
our processor cannot become stuck. 


e The architecture includes hardware interrupts that are issued on illegal 
memory accesses. Our microarchitecture does not immediately issue hard- 
ware interrupts. However, P skips cycles in order to simply wait until we 
periodically check for an illegal memory access. Because we periodically 
check, we always make progress towards an interrupt if the processor is in 
a fault state. 


e Our architecture operates over values in Z2s2, but our microarchitecture 
operates over values in Zp. However, we compute mod 23? on every cy- 
cle, and thus while our representation can capture any value in Zp, our 
registry and main memory will never store a value > 2°”. Additionally, no 
operation will overflow the field: multiplication has the largest outputs, 
with a maximum output value of 32 - (23? — 1) < 29”. 


e Our architecture performs only one instruction per cycle, but our microar- 
chitecture must emulate every instruction on every cycle. This difference 
is easily accounted for by (1) issuing dummy memory look-ups when a 
register/address is not needed, (2) having P provide a dummy input on 
cycles when input is not needed, and (3) multiplexing (via small table 
lookup) the effect of all instructions based on the current instruction op- 
code. Thus, while we compute the effect of each instruction, we only apply 
the effect of the active instruction. 


Our microarchitecture is complete. 
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Even though our microarchitecture is described in the OT-hybrid model, 
we still need computational assumptions: we perform zero-tests by comparing 
hashes of a set of zero labels. 

To define soundness, we define a syntactic interface to the prover adversary 
A. A takes the program P, the extended witness J, and her share of I as input: 
A(P,I,[I]). A’s interface is different from P’s: A takes as input the already- 
encoded shares [J]. That is, A does not participate in the OT step of u, and 
instead directly receives P’s shares of I. 


Definition 4 (Proof Microarchitecture Soundness). A proof microarchitecture 
u soundly implements our architecture if for all programs P in our ISA, all ex- 
tended witnesses I such that our architecture does not reach QED within poly(1") 
cycles, and all probabilistic poly-time adversaries A, the probability that A(P, I, [I]) 
causes V to output 1 is negligible in k. 


Similar to [JKO13]’s definition of soundness, the above ensures that mali- 
cious P cannot win unless she has input labels corresponding to J such that 
P(I) terminates in QED. 


Theorem 4. Assuming a collision resistant hash function, that the prime field 
modulus p > 23", and that |log p| > o, our microarchitecture (Construction 2) 
is a sound (w.r.t. o) implementation of our architecture (Construction 1) in 
the OT-hybrid model. 


Proof. By soundness of [HK20a]’s share algebra extended with our generalized 
vector-scalar multiplication protocol. [HK20a]’s share algebra can be framed as 
a privacy-free garbling scheme, and hence is sound via [JKO13]’s protocol. 
Our microarchitecture is built on share algebra, so primitive operations are 
sound. We introduce opportunities to cheat by allowing P to freely choose 
certain inputs, but we account for each of these opportunities by forcing P to 
prove certain values are equal to zero (as an optimization, we use a collision- 
resistant hash function to compute and verify only a digest of all zero proofs): 


e To compute mod 232, P freely subtracts the top bits from the ALU out- 
put. Suppose P cheats and subtracts top bits that do not correctly com- 
pute mod 232, The ALU output is subsequently stored in the registry. 
The next time the invalid register is accessed, it will be decomposed into 
32 bits, and the validity of the decomposition will be checked. Since the 
register value does not fit in 32 bits, this check will fail and V will reject. 


e Each cycle, the processor updates pc and the registry by multiplexing 
over the effects of all possible instructions. To multiplex these effects, 
P freely chooses a row of a small lookup table. Suppose P cheats and 
chooses a row that does not correspond to the current instruction. One 
of the columns of the table is the instruction opcode. Before the cycle 
terminates, P must prove that the looked up opcode (together with the 
accessed registry indices) matches the current instruction. Since P cheated 
and does not have the correct opcode, she cannot pass this check. 
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P cannot cheat by skipping cycles. When a cycle is skipped, the processor 
makes no progress, except to partition the content of BubbleCache. 

P can cheat by guessing a share that does not arise from the program ex- 
ecution. Most directly, she can guess a pc share that moves the processor to 
QED. As discussed in Section 5, this guess succeeds with probability i and 
hence only succeeds with probability negligible in ø. Even if there are multiple 
QED instructions, P’s chances are not improved: she not only has to reach QED, 
but also needs to know which specific pc value she reached such that she can 
appropriately prove the current pc value is in the ROM. This is a particular 
case of a more general property of the protocol: to cheat, P must both know a 
valid share and know the corresponding clear-text value. 

Our microarchitecture is sound. 


Theorem 5. Our microarchitecture is malicious-verifier Zero Knowledge. 


Proof. Immediate. The [HK20a] protocol is malicious-verifier Zero Knowledge, 
as proved by [JKO13]. 


Theorems 3 to 5 together imply Theorem 1. 


E Cache Miss Rate, Extended 


The Figure 8 statistics are extended in Figure 10. 
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#define READ_ORACLE 1 
#define FD_REGISTER ... 
#define LEN_REGISTER ... 


int _read(int fd, char *buf, int len) { 
proveroracle(READ_ORACLE, fd, len); 
int rc = readtape(); 
if (rc > len) { HALT; } 
for (int i = 0; i < rc; i++) { 
buf [i] = readtape() & OxFF; 
} 
return rc; 


} 


void read_orc(state* s) { 
int fd = s->registry[FD_REGISTER] ; 
int len = s->registry[LEN_REGISTER] ; 
char data[] = new char[len]; 
// calls Linux’s read function 
int rc = read(fd, data, len); 
writetape(s->input, rc); 
for (int i = 0; i < rc; i++) { 

writetape(s->input, datal[i]); 

} 
delete[] data; 

} 

REGISTER_ORACLE(READ_ORACLE, &read_orc) ; 


Figure 6: The ZK procedure for a file-system read (top) and the corresponding 
oracle call procedure run only by P (bottom). _read is run explicitly by the ZK 
machine. When P executes this procedure and reaches the proveroracle call, 
she temporarily escapes the ZK processor, runs the procedure read_orc on her 
local machine to populate the machine’s input tape, and then re-enters the ZK 
processor. In contrast, V performs a no-op at the proveroracle call. Impor- 
tantly, -read checks that P provided a valid number of bytes. REGISTER_ORACLE 
is a helper procedure that lets the programmer extend ?’s oracle call table 
Note, because oracle calls (e.g., read_orc) are run in cleartext on P’s machine, 
they are compiled separately from the proof statements and can be written in 
C++, 
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Figure 7: BubbleCache’s and BubbleRAM’s amortized communication con- 
sumption per cycle as a function of the size of RAM. Each OT swaps two 
pairs of words and sends 81 bits (40 per pair element and 1 overhead bit). 


Benchmark RAM Size | Time Comm. Clock rate 
(Words) 

Sort (500) 215 | 1.78x 1.44x 1.92x 

Sum (5,000) 215 | 1.34x 1.54x 1.36x 

sed 213 | 1.26« 1.29x 1.40x 

gzip 217 | 1.15x 1.31x 1.23x 


Figure 8: Overall system performance improvement when implementing RAM 
with BubbleCache versus with BubbleRAM. We tabulate (1) wall-clock time re- 
duction, (2) total communication (including non-RAM actions) reduction, and 
(3) clock-rate (i.e., Hz) increase. We configured ZKM, setting RAM to both 
BubbleCache and BubbleRAM for each of four benchmarks: calling merge sort 
on 500 random numbers, summing a list of 5,000 random numbers, gzip, and 
sed bug proofs. An expanded table with additional metrics is given in Fig- 
ure 10. 
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ALU([2], [y] : 
[z] <— [1] © z will in the end denote x = 0 


for i € 0..31: 
({1 — zil k) <— (1 —2;)- (1, [2]) > Decompose x into bits and check if x is zero. 
[pow] +- [1] > pow will in the end denote 2” mod 2°” 
for i € 0..31: 
Decompose y into bits, compute bit operations, multiply (with overflow), and compute 2” 
31—i 
2x mod 2°] = DD 27i . [xy] 
j=0 
ôpow] — (29 mod 23) — 1) - [pow] 
[yi]. [eA ye], [yi (2x mod 2°), [Spow]) — ys > (LU, [zi], Bir mod 2°], [5,0u]) 
(eV yji] = [ei] + [ye] - [ery] 
(cx@y)i] — [eV yll- [z ^y] 
pow] + [pow] + [ôpow] 


31 31 
prove |= (>: 2 2) =[0] ; prove [y] — 2 2" 2) = [0] 
i=0 i=0 


31 
return ([æ + yl, [2° +- yl, (Ei - (2'x mod =) , [e @ yl, [e A yh, [ev yl, Il, [v31]; pow) 
i=0 


Figure 9: The core of our ALU computes ADD, SUB, MUL, XOR, AND, OR, EQZ, MSB, 
and POW2. We denote the ith bit of a value x by writing x;. If a variable appears 
outside a share and is not a constant, it indicates that this is a choice made by 
honest P. The full algorithm consumes only 64 OTs. Addition, subtraction, 
and multiplication all include the possibility of an overflow, which is dealt with 
elsewhere in the processor. 


44 


Bench- RAM Instrs. # Cycles Time Cache IPC Comm. Millions Clock 
mark Access (s) Misses (MB) of OTs (KHz) 
BubbleRAM 524399 99.0 0.00% 1.000 1755 144.0 5.30 
BubbleCache 1/6 524399 65.9 0.00% 1.000 1409 108.0 7.96 
Sort BubbleCache 1/3 527053 53.9 0.49% 0.995 1212 87.4 9.78 
(500) BubbleCache 1/2 | 524399 135611 565503 55.6 5.57% 0.927 1216 85.0 10.17 
BubbleCache 2/3 593171 67.3 5.92% 0.884 1249 86.5 8.81 
BubbleCache 5/6 809246 86.6 20.43% 0.648 1674 115.0 9.34 
BubbleCache 1 828633 87.3 20.52% 0.633 1708 117.0 9.50 
BubbleRAM 280025 34.4 0.00% 1.000 939 77.0 8.15 
BubbleCache 1/6 280025 28.5 0.00% 1.000 754 57.8 9.83 
Chace BubbleCache 1/3 280039 25.2 0.01% 1.000 647 46.7 11.11 
(5000) BubbleCache 1/2 | 280025 76377 283077 25.6 2.78% 0.989 611 42.8 11.05 
BubbleCache 2/3 283333 24.4 2.79% 0.988 598 41.4 11.63 
BubbleCache 5/6 381466 32.1 16.68% 0.734 790 54.3 11.89 
BubbleCache 1 382554 32.6 16.70% 0.732 788 54.0 11.74 
BubbleRAM 350002 45.4 0.00% 0.983 1130 89.7 7.71 
BubbleCache 1/6 350002 37.6 0.00% 0.983 973 73.4 9.30 
BubbleCache 1/3 350002 35.1 1.02% 0.983 854 61.1 9.98 
sed BubbleCache 1/2 | 344086 56525 390002 36.1 11.45% 0.882 874 60.8 10.81 
BubbleCache 2/3 400002 37.1 11.91% 0.860 943 65.1 10.79 
BubbleCache 5/6 530002 53.1 27.94% 0.649 1213 83.2 9.98 
BubbleCache 1 540002 54.9 28.10% 0.637 1224 83.8 9.83 
BubbleRAM 41224 7.5 0.00% 1.000 210 17.1 5.53 
BubbleCache 1/6 41224 6.9 0.00% 1.000 177 13.7 6.01 
BubbleCache 1/3 41458 6.3 0.76% 0.994 162 12.2 6.53 
gzip BubbleCache 1/2 41224 6687 44092 6.5 8.28% 0.935 161 11.9 6.82 
BubbleCache 2/3 45628 6.5 8.78% 0.903 162 11.9 7.02 
BubbleCache 5/6 60643 7.3 29.85% 0.680 186 13.5 8.28 
BubbleCache 1 62499 7.7 29.98% 0.660 188 13.6 8.15 


Figure 10: Performance characteristics of our system with various RAM con- 
figurations and on various benchmarks. We record (1) the total number of 
instructions needed to reach QED, (2) the number of LOAD/STORE instructions in 
the execution, (3) total cycles (including skipped cycles), (4) the total execution 
time, (5) the cache miss rate (i.e., the ratio of missed loads/stores to LOAD/STORE 
instructions), (6) the number of instructions executed per cycle (IPC), (7) the 
total communication in MB (8) the number of OTs in millions, and (9) our 
system’s clock-rate in KHz. Note that each cache miss can be responsible for 
more than one skipped cycle. Hence, cache miss rate and IPC can differ. For 
example, the sed bug causes one cache miss for BubbleRAM on the illegal ac- 
cess, but that single miss incurs almost 6,000 skipped cycles while waiting for a 
hardware interrupt. We ran each of four benchmarks: merge sort on a list of 500 
random numbers, summing a list of 5,000 numbers, our sed bug benchmark, 
and our gzip bug benchmark. We ran each benchmark using BubbleRAM and 
using BubbleCache configured with various delay constants C. 
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